libraries
The following analysis makes use of the tidyverse suite of tools, along with help from quanteda and qdap for tokenization and cleaning.
Corpus Overview
The Legolas corpus is a subset of an extremely large corpus. For the sake of text processing, all non-English texts have been removed from the Legolas corpus. That said, on occasion there may be non-English words interspersed within these texts.
Corpus word count distribution
The corpus consists largely of shorter works. In the fanfiction world this is called “fluff.”
Histogram of word count
#get list of English names
fanfiction_word_histogram <- fanfiction_names_df %>%
filter(Language != "English" | Word_count<10) %>%
mutate(ID = as.integer(ID)) %>%
select(ID, Word_count)The histogram reveals that around 60% of the works are 5,000 words or fewer. Meanwhile, a full 16% of works have fewer than 1000 words.
Case Study 1: Readability
Assumption These works are easy to read.
Thesis Readability and popularity are inversely correlated. As texts become more complex their popularity goes down.
Data Creation: Readability
First, generate an overview of the syntactical and lexical complexity. This can be done with the help of the quantdata package see here for reference.
Computation: Chunking readability scores
The following series of functions take the fanfiction_df and turn it into a corpus object using only the ID and Content as variables. The remainder of the df is left aside for later joining.
#get list of English names
fanfiction_not_english <- fanfiction_names_df %>%
filter(Language != "English" | Word_count<10) %>%
mutate(ID = as.integer(ID)) %>%
select(ID)fanfiction_df_small <- fanfiction_df %>%
distinct(ID, Content) %>%
anti_join(fanfiction_not_english) %>%
mutate(ID = as.integer(ID))The following loop runs through the corpus to fetch the readability scores. This function had to be chunked up due to memory issues.
The distribution across readability scores is relatively regular with tests providing contradictory information. Depending on the test the texts are either really simple or complex.
Principal Component Analysis
The following chart takes the reading score and tests whether a variance within the reading score causes a variance in the popularity variables. There does not appear to be any relationship as the explained variance of PC1 and PC2 is low.
Principal Component Analysis: Popularity and Relationships
No relationship between Pairing and popularity. Works that have specific pairings do not stick together in terms of their popularity.
Scatter Plot of word_count popularity
Box plot word count by year
It does not appear that the word count points to texts getting shorter. The Interquartile Range for text length steadily increases till about 2016 and then starts to decrease. This does not appear to be patterned.
How do pairings change over the years.
Pairings by year in percent. The first year is distorting because there are very few stories.
Content warnings by year
Popular tag
This is a chart of average popularity by tag
This chart shows the popularity of the 15 most common tags. The size of the circle is determined by the count of the particular tag.
Some interesting tags in terms of engagment, which is calculated as the ratio of Kudos to views. That is, how many people who read a certain tag also give it a kudos. The tag list is very complicated and needs to be refined more.
Co-Occurrence matrix Tag and Pairing
Convert the tibble to a corpus object.
Get basic descriptive statistics of text complexity. This information will need to be filtered out as some of the texts are clearly not accurately measured or have significant parts on in English.
Join readability scores with author and work details.
## Join the tibbles back by author ID
# readability_text <- fanfiction_df %>%
# select(!(Content)) %>%
# left_join(readability, by = c("ID" = "document")) %>%
# left_join(fanfiction_names_df)
# #filter(Language == "English" | Word_count >2000) Possible analysis, do a correlation analysis of reading score and likes/kudos or comments.
Advanced analysis, do PCA of the different variables.
Tokenizing
Converting text elements to tokens.
# Tokenisation
# tok <- tokens(mycorpus, what = "word",
# remove_punct = TRUE,
# remove_symbols = TRUE,
# remove_numbers = TRUE,
# remove_url = TRUE,
# remove_hyphens = FALSE,
# verbose = TRUE,
# include_docvars = TRUE)
# tok <- tokens_tolower(tok)
#
# tok <- tokens_select(tok, stopwords("english"), selection = "remove", padding = FALSE)Measuring lexical diversity through tokens
# lexical_diversity <- dfm(tok) %>%
# textstat_lexdiv(measure = "TTR")TTR Table
# lexical_diversity <- lexical_diversity %>%
# mutate(document = as.numeric(document))
#
# lexical_diversity_table <- fanfiction_df %>%
# select(!(Content)) %>%
# left_join(lexical_diversity, by = c("ID" = "document")) %>%
# left_join(fanfiction_names_df)NGRAM
Ngrams had to be created in chunks to account for the heavy memory usage.
ngram_df <- NULL
for (i in 1:nrow(fanfiction_df_small)) {
temp <- NULL
temp <- fanfiction_df_small %>%
filter (row_number()==i) %>%
group_by(ID) %>%
unnest_tokens(bigram, Content, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
mutate(bigram = paste(word1,word2, sep = " ")) %>%
count(bigram, sort = TRUE) %>%
top_n(n, n = 100)
ngram_df <- ngram_df %>%
bind_rows(temp)
if (i%%250 == 0) {
print(paste("Now processing text ",i," of", nrow(fanfiction_df_small), sep = ""))
#added this to make sure R wasn't freezing.
}
}ngram_total <- ngram_df %>%
group_by(bigram) %>%
summarise(total_bigrams = sum(n))
ngram_top_thousand <- ngram_total %>%
ungroup() %>%
arrange(desc(total_bigrams)) %>%
top_n(n=1000)
ngram_top_ten_table <- ngram_top_thousand %>%
top_n(n=50) %>%
addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF"),
pos.caption = "bottom") %>%
htmlTable(caption = "Top 50 ngrams")
ngram_top_ten_tableDistinct words
Measuring distinct words in a corpus is complex. Heaps’ law indicates that the number of unique words is the inverse square of the total number of words. That is as texts get longer their unique words go down because there are only so many words in the English language. A better measure is . The code below does this for the entire corpus, but fetches a lot of false positives, because what it sees as “unique” are actually scraping errors. It’s a good way to figure out what will need to be cleaned from the corpus.
#Create a clean list of the count of each word in the corpus with contractions removed and possessives stripped, i.e. remove 's.
book_words_clean <- fanfiction_df_small %>%
unnest_tokens(word, Content)
book_words_clean_filtered <- book_words_clean %>%
filter(!str_detect(word,"['’.]"))
book_words_punctuation <- book_words_clean %>%
filter(str_detect(word,"['’`.]"))
book_words_punctuation <- book_words_punctuation %>%
mutate(word = qdap::replace_contraction(word, sent.cap = FALSE))
book_words_punctuation_extended <- book_words_punctuation %>%
mutate(word = str_remove_all(word,"['’.]")) %>%
mutate(word = str_remove_all(word, " s")) %>%
separate_rows(word, sep = " ")book_words_final <- book_words_clean_filtered %>%
bind_rows(book_words_punctuation_extended)book_words_final <- book_words_final %>%
count(ID, word, name = "nr_words", sort = TRUE) Get the total number of words by work.
total_words <- book_words_final %>%
group_by(ID) %>%
summarize(total = sum(nr_words))Add the totals to the book_words tibble.
book_words_total <- left_join(book_words_final, total_words) Now calculate the tf_idf
book_tf_idf <- book_words_total %>%
bind_tf_idf(word, ID, nr_words)TF_IDF table
Thsi table needs to be joined back with the metadata to make some sense of it.
Get the top 3 distinct words for each book. This is still a lot of data to look through, but it might give an indication as to what is important about each work.
Join the metadata back to the distinct words table.
Distinct words of the corpus.
# distinct_words_total <- distinct_words %>%
# ungroup() %>%
# anti_join(stop_words) %>%
# group_by(word) %>%
# summarise(total_words = sum(nr_words)) %>%
# arrange(desc(total_words))
#
#
# distinct_words_total_table <- distinct_words_total %>%
# top_n(total_words, n= 50)
#
# distinct_words_total_table %>%
# addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF"),
# pos.caption = "bottom") %>%
# htmlTable(header = c("Word", "Total Occurrences"),
# caption = "Top 50 unique words in the corpus by number or occurrences")